Overview

In 2020 the PCCRC funded a project with the following objectives:

  1. Implement an efficient genetic stock identification (GSI) analytical pipeline to retrospectively identify consistent strata (space, time, fishing sector, fish age) for comparison across years
  2. Quantify relationships between Bering Sea chum salmon PSC and biotic/abiotic variables for a suite of different strata (space, time, fishing sector, fish age) and examine how changes in the spatial distribution in the pollock fleet may affect these relationships
  3. Quantify spatial patterns of chum salmon bycatch by age and stock of origin to determine whether near real-time information (e.g., satellite-derived sea surface temperatures) could be used to minimize impact of PSC on Alaska chum salmon stocks.

In addressing the first objective we decided that a through evaluation of the baselines for Chinook and Chum needed to be done. The genetics lab will be using Rubias instead of Bayes for the genetic stock identification (GSI) analyses and moving from a microsatellite baseline for chum to a SNP baseline.

The purpose of this document is to make the analysis as reproducible as possible and serve as a useful way of collaboration among individuals within the lab.

Methods

I don’t have much of the information on tissue sampling, extractions, amplifications so for now I will focus on the statistical analysis.

Statistical Analysis

  1. HWE and LD?
  2. \(F_{st}\) Pairwise Fst values (Nei 1987) were calculated with the package hierfstat.
  3. PCA PCA was done with the package adgenet (import as genid object) and ade4 (the actal duidi.pca)
  4. NJ tree The neighbor joining trees were constructed with ggTree based on the Nei’s Fst
  5. GSI evaluations 6a. Leave one out tests - evaluate how well fish can be assigned to their population of origin. We can compare these with the PofZ values for Realistic fishery simulations. 6b. 100% simulations - A mixture is simulated; all of which originate from the same population or reporting group. 6c. Realistic fishery simulations - 100% simualtions are unrealistic

Chum Microsatellite Baseline

Description of Baseline

The chum microsatellite baseline (allele frequencies) can be downloaded from the Division of Fisheries and Oceans Canada (DFO) Molecular Genetics web page (http://www-sci.pac.dfo-mpo.gc.ca/mgl/data_e.htm) [Beacham 2009b, Beacham2009c]. The allele frequencies were converted to allele counts by Guyon et al. (J. Guyon et al. 2010) to create a BAYES (Pella 2001) baseline file. The program rubias (Moran and Anderson 2019) requires genotypes for individuals. Assuming linkage and Hardy-Weinberg equilibirum, I generated individual genotypes by sampling the allele counts without replacement (https://patbarry6.github.io/ABL_GSI/). This will maintain the allele frequencies and sample sizes for each population. There was a single issue where there were an odd number of alleles scored (Locus 2 in population 240). This issue may be the result of conversion from allele fequencies from DFO to allele counts or just a rounding issues from DFO with an allele frequency. My solution was to add one to the most abundant allele. This does not appreciably change the allele frequency and seems more reasonable that doubling the population size to have an even number of alleles to sample. Doing so would give us an increased confidence in our characterization of that population.

Microsatellite allele from the NOAA lab were converted to allele size calls from DFO by running a set of samples from DFO run on the ABL 3130xl. When TSMRI exports their allele calls they need to convert them to the DFO allele sizes before running GSI on their baseline.

The baseline is composed of 381 populations grouped into six regions: Southeast Asia, Northeast Asia, Western Alaska, Upper/Middle Yukon, Southwest Alaska, and the Eastern Gulf of Alaska/Pacific Northwest (C. A. V. Kondzela C.M. AND Marvin and Guyon 2013). These groupings were selected based on principal coordinate analysis (PCO), NJ trees with Cavalli-Sforza and Edwards chord distances [Beacham2009], and simulations were conducted in SPAM software (100%, equal, and realistic fishery mixture proportions). I wonder if we will be able to split out the E GOA and PNW populations with the new SNP baseline. That might open up a giant can of political worms.

The six regional reporting groups used for GSI for chum salmon

The six regional reporting groups used for GSI for chum salmon

So let’s read in the chum microsatellite baseline formatted for rubias.

ChumUsatBase<-read.csv("./MasterData/ChumBaseline_rubias.csv",stringsAsFactors = F)
RepGroups_ChumUsat<-ChumUsatBase%>%
          group_by(repunit)%>%
          summarise(nPops = length(unique(collection)),
                    nInd = length(unique(indiv)))
  
knitr::kable(RepGroups_ChumUsat)
repunit nPops nInd
E_GOP_PNW 242 32767
EAsia 31 2916
NAsia 31 2804
SW_Alaska 17 1636
UpMidYukon 23 4431
WAlaska 37 6661

For the reporting groups the size of the GOA/PNW greatly exceeds that of the other groups. And Southwest AK has the fewest number of collections. Let’s take a closer look at how man indiviuals represent each collection.

CollectionInfo_ChumUsats<-ChumUsatBase%>%
          group_by(repunit,collection)%>%
          summarise(nInd = length(unique(indiv)))

CollectionInfo_ChumUsats%>%
  group_by(repunit)%>%
  top_n(10)%>%
  knitr::kable()%>%
  kableExtra::kable_styling()
repunit collection nInd
E_GOP_PNW Clapp_Basin 362
E_GOP_PNW Disappearance 325
E_GOP_PNW Inch_Creek 401
E_GOP_PNW Indian_River 341
E_GOP_PNW Kitasoo 408
E_GOP_PNW Lagoon_Inlet 364
E_GOP_PNW McLoughin_Creek 377
E_GOP_PNW Nekite 486
E_GOP_PNW Nimpkish 408
E_GOP_PNW Squakum 385
EAsia Abashiri 130
EAsia Chitose 268
EAsia Gakko_River 159
EAsia Kawabukuro 115
EAsia Shibetsu 110
EAsia Shiriuchi 155
EAsia Tokachi 129
EAsia Tokoro 119
EAsia Tokushibetsu 158
EAsia Yurappu 160
NAsia Amur 339
NAsia Bolshaya 97
NAsia Hairusova 182
NAsia Kikchik 102
NAsia Naiba 147
NAsia Ola_ 119
NAsia Ossora 137
NAsia Pymta 99
NAsia Tugur_River 98
NAsia Vorovskaya 248
SW_Alaska Alogoshak 94
SW_Alaska American_River 95
SW_Alaska Delta_Creek 94
SW_Alaska Frosty_Creek 178
SW_Alaska Gertrude_Creek 99
SW_Alaska Joshua_Green 99
SW_Alaska Moller_Bay 95
SW_Alaska Pumice_Creek 95
SW_Alaska Stepovak_Bay 99
SW_Alaska Volcano_Bay 106
UpMidYukon Big_Creek 196
UpMidYukon Chandalar 225
UpMidYukon Cheena 230
UpMidYukon Fishing_Br 665
UpMidYukon Kluane 535
UpMidYukon Porcupine 325
UpMidYukon Salcha 182
UpMidYukon Sheenjek 256
UpMidYukon Tatchun 170
UpMidYukon Toklat 245
WAlaska Andreafsky 331
WAlaska Eldorado 397
WAlaska Gisasa 294
WAlaska Kobuk 379
WAlaska Kwiniuk_River 269
WAlaska Niukluk 232
WAlaska Pikmiktalik 394
WAlaska Pilgrim_River 478
WAlaska Snake 399
WAlaska Tozitna 353

Within the collections the average size is 134.42 (min = 12 and max = 665). Surprisingly, we have pretty good sample sizes for the Russia (NAsia collections).

HWE & LD

This baseline has been extremely well vetted. It seems like a small waste of time to run these tests right now, but if we decide to make this a short manuscript it might be a good idea to re-run the analyses just to be on the safe side.

\(F_{st}\)

First we need to convert the rubias input file into a genepop input file with a POP designator. This shouldn’t be too difficult. We can just create an index by population and write each population as a group of lines to a text file.

Loci<-colnames(ChumUsatBase)[-(1:4)] %>%
      .[c(TRUE,FALSE)]
Title<-"ChumBaseline"


ChumGenos<-sapply(1:(length(Loci)*2), function(x) formatC(ChumUsatBase[,-(1:4)][,x], width=3,
                                                          format="d",flag="0")) 
colnames(ChumGenos)<-colnames(ChumUsatBase[,-(1:4)])

ChumGenos6<-lapply(Loci, function(x) apply(ChumGenos[,
                                                 grep(pattern=x,
                                                     x=colnames(ChumGenos))],
                                             MARGIN = 1,
                                             paste,
                                             collapse = "")%>%
                     as.vector())

ChumGenosMat<-do.call(cbind,ChumGenos6)
ChumGenosMat[ChumGenosMat == " NA NA"]<-"000000"

collections<-unique(ChumUsatBase$collection)
PopIndex<-lapply((1:length(collections)), function(x) 
                              which(ChumUsatBase$collection==collections[x]))

write.table(Title,"ChumBaseline_gp.gen",append=F,quote=F,row.names=F,col.names = F)
write.table(Loci,"ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)

for (x in 1:length(collections)){
  write.table("POP","ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)
  TempPop <- data.frame(id=ChumUsatBase[PopIndex[[x]],4],
                      ChumGenosMat[PopIndex[[x]],],stringsAsFactors = F)
  Lines2write <- sapply(1:nrow(TempPop),function(y) 
                   paste(TempPop[y,1], " , " ,
                         paste(as.character(TempPop[y,-1]),collapse=" "),
                         sep=""))
  write.table(Lines2write,"ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)
  }

Ok, so now we have a genepop file for the chum microsatellite baseline. Let’s pull it into R with adgenet’s read genepop function and then calculate some Nei’s \(F_{st}\) values. In the markdown document we won’t evaluate this piece everytime becuase the bootstrapping over 300 populations can take some time. Instead we will read in the results at the top of the document.

Chum_uSatBase.genid<-adegenet::read.genepop("./ChumBaseline_gp.gen",ncode=3)
ChumuSatBase_hierf <- hierfstat::genind2hierfstat(Chum_uSatBase.genid)

Chum_uSatFst.Nei<-hierfstat::pairwise.neifst(ChumuSatBase_hierf)

Chum_uSatFst_bs <- hierfstat::boot.ppfst(dat=ChumuSatBase_gdf,nboot=100,quant=c(0.025,0.975),diploid=TRUE)

diag(Chum_uSatFst.Nei)<-0

I originally made a heatmap based on the \(F_{st}\) distances, but it is pretty darn ugly, and a neighbor joining tree uses the same information in a much easier format to digest.

NJ Tree

NJ.chumuSat.Nei<-phangorn::NJ(Chum_uSatFst.Nei)

NJ.chumuSat.Nei$tip.label<-unique(gsub(pattern="_[0-9]+$",replacement="",Chum_uSatBase@pop))

PopRepGroups_ChumuSat<-ChumUsatBase%>%
   group_by(repunit,collection)%>%
   slice(1)%>%
   select(Pop = collection ,RepGroup = repunit)%>%
   ungroup()
 
# p + geom_text2(aes(subset=!isTip, label=node)) #use this line to get the node number

# viewClade(pNei, node=612) #use this line to look at a specific node
  
P_ChumuSat<-ggtree(NJ.chumuSat.Nei) %<+% PopRepGroups_ChumuSat +
  geom_tiplab(aes(fill = factor(RepGroup)),
              size = 1.75,
              color = "black", # color for label font
              geom = "label",  # labels not text
              label.padding = unit(0.15, "lines"), # amount of padding around the labels
              label.size = 0.01) + 
  theme(legend.position = c(0.8,0.2), 
        legend.title = element_blank(), # no title
        legend.key = element_blank())

P_ChumuSat

That looks pretty good. Southeast Asia, Western Alaska, and the Upper/Middle Yukon all look like great groupings. All the collections in the reporting units are included under distinct nodes. The majority of the SW Alaska collections are grouped within a node with GOA/PNW samples; however, there are a few collections from South Bristol Bay that group with the North Asia collections.

Let’s take a closer look at that one

viewClade(P_ChumuSat, node=628)

Ok, so it is a bunch of South Bristol Bay pops that are clustering with the Asia collections. Why would that be? Why wouldn’t we see Asia collections more similar to the Alaska Peninsula pops. The glacial history and the long branch of the Sturgeon River collection from SW Alaska might give us some insight. Sturgeon is on the western part of Kodiak island. The Sturgeon River chum salmon have a unique run timing compared to other chum populations in the Kodiak Archipelago. Sturgeon chum enter the lagoon area earlier (in late May to mid-June) compared to other Kodiak chum salmon stocks (mid-to-late July; (Price 2001)). It is hypothesized that Sturgeon is highly differentiated because the southwest portion of Kodiak island remained ice-free during the last glaical maxima (Petrou et al. 2014) and that the run timing difference has minimized the homogenizing effect of geneflow.

Lets take a look at a principal component analysis (PCA).

PCA

X.Chum_uSat <- scaleGen(Chum_uSatBase.genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca.Chum_uSat <- dudi.pca(X.Chum_uSat,cent=F,scale=F,scannf=FALSE) #nf is kept axes, note you can center and scale in this step
barplot(pca.Chum_uSat$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))

PopRGmap<-data.frame(Pop = as.character(unique(pop(Chum_uSatBase.genid))),
                     RepGroup = as.character(sapply(1:length(unique(pop(Chum_uSatBase.genid))),function(x)
                       ChumUsatBase[grep(unique(pop(Chum_uSatBase.genid))[x],
                                                   x=ChumUsatBase$indiv),2])),
                     stringsAsFactors = F)

Chum_uSat_PCAdf<-data.frame(Ind = dimnames(Chum_uSatBase.genid@tab)[[1]],
                  Pop = pop(Chum_uSatBase.genid),
                  RepGroup = plyr::mapvalues(
                    x=pop(Chum_uSatBase.genid), from = PopRGmap$Pop, 
                    to = PopRGmap$RepGroup),
                  pca.Chum_uSat$li,
                  stringsAsFactors = F) 

Chum_uSat_PCAdf$Pop <- gsub(PCAdf$Pop, pattern="_[0-9]+$", replacement = "" )


ggplot(Chum_uSat_PCAdf) +
  geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
  theme(axis.line = element_blank(),
        panel.background = element_blank()) +
  geom_vline(xintercept = 0)+
  geom_hline(yintercept = 0)+
  scale_color_discrete(name = "Reporting Group") 

Chum_uSat_PCA1_Loads<-data.frame(pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),],
                       pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),]^2,
                       cumsum(pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),]^2))
Loads<-head(Chum_uSat_PCA1_Loads,10)%>%
  select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)

#Let's look at Ots3 allele freq across the rep groups
OtsInfo<-data.frame(Ind = dimnames(Chum_uSatBase.genid@tab)[[1]],
                    Pop = pop(Chum_uSatBase.genid),
                    RepGroup = plyr::mapvalues(
                      x=pop(Chum_uSatBase.genid), from = PopRGmap$Pop, 
                      to = PopRGmap$RepGroup),
                    Chum_uSatBase.genid@tab[,grep("Ots3",colnames(Chum_uSatBase.genid@tab))]
)

OtsRGcounts<-OtsInfo%>%
  group_by(RepGroup)%>%
  select_at(-(1:2))%>%
  replace(is.na(.), 0)%>%
  summarise_all(funs(sum))

OtsRGfreq<-OtsRGcounts%>%
  {sapply(2:ncol(.),function(x) .[,x]/sum(.[,x]))}%>%
  data.frame(RepGroup = OtsRGcounts$RepGroup,
             .)

AllelesInt<-c(106,086,092,094)

OtsRGfreq%>%
  .[,c(1,sapply(1:length(AllelesInt),function (x) grep(pattern=AllelesInt[x],x=colnames(OtsRGfreq))))]
##     RepGroup   Ots3.106     Ots3.086    Ots3.092    Ots3.094
## 1      EAsia 0.03055813 0.0016109886 0.142421470 0.046413502
## 2      NAsia 0.08847546 0.0074190266 0.101250381 0.008077155
## 3    WAlaska 0.34817746 0.0105138206 0.189082037 0.013984328
## 4 UpMidYukon 0.32620591 0.0002119722 0.434583715 0.001928873
## 5  SW_Alaska 0.03720852 0.0066135323 0.004574565 0.002290536
## 6  E_GOP_PNW 0.16937453 0.9736306597 0.128087832 0.927305606

Similar to the NJ tree, the major division among the reporting groups is between the GOA/PNW collections and all other reporting groups (Axis 1). From the loadings it appears that the locus Ots3 is really informative. The second PC separates Asia from Alaska collections.

GSI evaluations

Self assignment (LOO)

The leave one out test evaluates how well each fish within the baseline can be assigned to their population or reporting group of origin. The test is conducted by removing fish one at a time from each baseline population and then estimating their collection of origin. These types of tests are very useful to see where the most common type of misallocation might be to. It is most easily viewed with a heatmap displaying the probability of assignment of fish back to each collection or reporting group.

sa_ChumuSats <- self_assign(reference = ChumUsatBase, gen_start_col = 5)
## Summary Statistics:
## 
## 51215 Individuals in Sample
## 
## 11 Loci: Oki100, Omm1070, Omy1011, One101, One102, One104, One114, Ots103, Ots3, Otsg68, Ssa419
## 
## 6 Reporting Units: EAsia, NAsia, WAlaska, UpMidYukon, SW_Alaska, E_GOP_PNW
## 
## 381 Collections: Namdae_R, Miomote, Kawabukuro, Hayatsuki, Gakko_River, Uono_River, Ohkawa, Tsugaruishi, Orikasa, Sakari_River, Koizumi_River, Tokachi, Kushiro, Teshio, Chitose, Toshibetsu, Shiriuchi, Shikiu, Yurappu, Shizunai, Tokoro, Abashiri, Horonai, Shari_River, Tokushibetsu, Shibetsu, Nishibetsu, Ryazanovka, Avakumovka, Narva, Suifen, Amur, Naiba, Kalininka, Tym_, Udarnitsa, Tauy, Ola_, Magadan, Okhota, Tugur_River, Oklan, Penzhina, Vorovskaya, Kol_, Pymta, Hairusova, Utka_River, Kikchik, Bolshaya, Plotnikova_R, Kamchatka, Ossora, Nerpichi, Ivashka, Karaga, Dranka, Apuka_River, Olutorsky_Bay, Zhypanova, Anadyr, Kanchalan, Kelly_Lake, Noatak, Imnachuk, Kobuk, Agiapuk, Koyuk, Peel_River, Pikmiktalik, Niukluk, Pilgrim_River, Kwiniuk_River, Snake, Unalakleet, Ungalik, Nome, Shaktoolik, Eldorado, Andreafsky, Chulinak, Tozitna, Anvik, Nulato, Melozitna, Gisasa, Aniak, George, Kanektok, Kasigluk, Kwethluk, Nunsatuk, Goodnews, Stuyahok_River, Mulchatna, Naknek, Togiak, Alagnak, Henshaw_Creek, Jim_River, Koyukuk_south, Koyukuk_late, Cheena, Salcha, Delta, Toklat, Kantishna, Sheenjek, Black_River, Chandalar, Big_Salt, Tatchun, Pelly, Big_Creek, Minto, Kluane, Donjek, Kluane_Lake, Fishing_Br, Porcupine, Teslin, Chandindu, Egegik, Gertrude_Creek, Pumice_Creek, Meshik, Moller_Bay, Joshua_Green, Frosty_Creek, Volcano_Bay, Coleman_Creek, Delta_Creek, Westward_Creek, Stepovak_Bay, Alogoshak, Big_River, American_River, Sturgeon, Uganik, Keta_Creek, Wells_River, Constantine, Olsen_Creek, Fish_Creek, Dipac_Hatchery, Gambier, Wells_Bridge, Herman_Creek, Sawmill, Greens, Kennell, Disappearance, Neets_Bay_late, Neets_Bay_early, Carroll, Nakut_Su, LagoonCr, Takwahoni, Taku, Tuskwa, Yellow_Bluff, Shustnini, Gold_Harbour, Mace_Creek, Botany_Creek, Clapp_Basin, Security, Dawson_Inlet, Mountain_Cr, Fairfax_Inlet, Kano_Inlet_Cr, Steel_Cr, Seal_Inlet_Cr, Thorsen, Pallant, Sedgewick, Lagoon_Inlet, Bag_Harbour, Salmon_River, Little_Goose, Dana_Creek, Pacofi, Surprise, Hutton_Head, Ain_, Awun, Naden, Stanley, Slatechuck_Cre, Government, North_Arm, Tarundl_Creek, Deena, Lagins, Buck_Channel, Honna, Kshwan, Khutzeymateen, Lachmach, Stagoo, Toon, Dak_, Kateen, Kitsault_Riv, Ensheshese, Wilauks_Cr, Illiance, Tseax, Lizard_Cr, Crag_Cr, Stumaun_Cr, Ksedin, Kincolith, Nass_River, Nangeese, Date_Creek, Upper_Kitsumkal, Kitwanga, Ecstall_River, Whitebottom_Cr, Kalum, Dog-tag, Andesite_Cr, Kleanza_Cr, Kispiox, Skeena, Zymagotitz, Wilson_Creek, Markle_Inlet_Cr, Pa-aat_River, Kumealon, Stewart_Cr, Kxngeal_Cr, Klewnuggit_Cr, Bella_Bell, Martin_Riv, Snootli, Cascade, Bella_Coola, Dean_River, Jenny_Bay, Kimsquit, Elcho_Creek, Kemano, Gill_Creek, Turn_Creek, Quaal_River, Kitimat_River, Green_River, Khutze_River, Foch_Creek, Blackrock_Creek, Nias_Creek, West_Arm_Creek, Kitasoo, Clatse_Creek, Quartcha_Creek, McLoughin_Creek, Roscoe_Creek, Mussel_River, Kwakusdis_River, Neekas_Creek, Duthie_Creek, Kainet_River, Barnard, Tyler, Gilttoyee, Lard, Cooper_Inlet, Bullock_Chann, Deer_Pass, Skowquiltz, Frenchman, Hooknose, Nooseseck, Kimsquit_Bay, Bish_Cr, Kiltuish, Salmon_Bay, Cheenis_Lake, Necleetsconnay, East_Arm, Arnoup_Cr, Flux_Cr, Turtle_Cr, MacNair_Creek, Chuckwalla, Clyak, Lockhart-Gordon, Quap, Ashlulm, Draney, Milton, Walkum, Nekite, Klinaklini, Ahta______, Viner_Sound, Waump, Taaltz, Nimpkish, Kakweiken, Glendale, Algard, Ahnuhati, Mackenzie_Sound, Heydon_Cre, Tzoonie, Cheakamus, Sliammon, Mamquam, Wortley_Creek, Squamish, Indian_River, Theodosia, Southgate, Orford, Shovelnose_Cr, Mashiter_Creek, Stawamus, Homathko, Goldstream, Cowichan, Nanaimo, Chemainus, Puntledge, Big_Qual, Little_Qua, Campbell_River, Cold_Creek, Smith_Cree, Demamiel, Nitinat, Hathaway_Creek, Pegattum_Creek, Goodspeed_River, Cayeghle, Colonial, Sugsaw, Nahmint_River, Silverdale, Squakum, Wahleach, Chilliwack, Chehalis, Stave, Alouette, Vedder, Harrison, Inch_Creek, Lower_Lillooet, NorrishWorth, Alouette_North, Widgeon_Slough, Kawkawa, Blaney_Creek, Chilqua_Creek, Serpentine_R, Kanaka_Cr, Worth_Creek, Hopedale_Cr, Hicks_Cr, Harrison_late, Nooksack, Skagit, County_Line, Tulalip, Grant_Creek, Siberia_Creek, SkykomishRiv, Kennedy_Creek, Minter_Cr, GreenRrHatchery, Big_Quilcene, Hoodsport, Salmon_Cr, Elwha, Ellsworth_Cr, Bitter_Creek, Quinault, Satsop
## 
## 7.98% of allelic data identified as missing
sa_to_repu_ChumuSats <- sa_ChumuSats %>%
  group_by(indiv, collection, repunit, inferred_repunit) %>%
  summarise(repu_scaled_like = sum(scaled_likelihood))

RepGroups <- unique(PopRepGroups_ChumuSat$RepGroup)[c(2,3,6,5,4,1)] 


sa_mean_scaled_like_ChumuSats <- sa_to_repu_ChumuSats  %>% 
  mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
  mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
  group_by(repunit_f, inferred_repunit_f) %>% 
  summarise(mean_repu_scaled_like = mean(repu_scaled_like))  #take the mean of the scaled liklihoods
  
#Generate a heatmap  
  ggplot(sa_mean_scaled_like_ChumuSats,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
  geom_tile(aes(fill = mean_repu_scaled_like)) +
    geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like_ChumuSats[,3]),fmt='%#.2f')))+
  scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Reporting Unit") +
  ylab("Inferred Reporting Unit")

This is somewhat surprising to me. While the missallocation of SW Alaska samples to the GOA seems reasonable, we have non-trivial misallocation of NE Asia samples to to the GOA/PNW. Let’s take a closer look at the NE Asia baseline samples by subsetting on them, then looking at the collection with the highest likelihood of assignment (we won’t sum over likelihoods for the reporting groups).

MissAlloc<-sa_ChumuSats%>%
  filter(repunit == "NAsia")%>%
  group_by(indiv,collection)%>%
  top_n(n=1,wt=scaled_likelihood)%>%
  filter(inferred_repunit != "NAsia")%>%
  {table(.$collection, .$inferred_collection)}

rowSums(MissAlloc)
##          Amur        Anadyr   Apuka_River      Bolshaya        Dranka 
##            89            29            18            19            16 
##     Hairusova       Ivashka     Kalininka     Kamchatka     Kanchalan 
##            57            16             1            25            22 
##        Karaga       Kikchik          Kol_       Magadan         Naiba 
##            14            39            30            28            39 
##      Nerpichi        Okhota         Oklan          Ola_ Olutorsky_Bay 
##            18            22            10            42            19 
##        Ossora      Penzhina  Plotnikova_R         Pymta          Tauy 
##            50             8            27            46            21 
##   Tugur_River          Tym_     Udarnitsa    Utka_River    Vorovskaya 
##            36            20            10             8            86 
##     Zhypanova 
##            10

The Amur and the Vorovskaya collections have an unusually high number of missallocations. This is likely due to really unequal sample sizes in the baseline collections. Lets take a quick look to confirm this.

 MissAllocR<-rowSums(MissAlloc) / CollectionInfo_ChumUsats[sapply(1:length(rowSums(MissAlloc)),
                                            function(x)
                                            grep(pattern = names(rowSums(MissAlloc))[x],
                                            x = CollectionInfo_ChumUsats$collection,
                                             )),3] 
  
MissAllocDF<-cbind(names(rowSums(MissAlloc)),MissAllocR)
colnames(MissAllocDF)<-c("Pop","Prop")

ggplot(MissAllocDF, aes(x=Prop))+
  geom_histogram()+
  geom_vline(xintercept=MissAllocDF[MissAllocDF$Pop%in%c("Amur","Vorovskaya"),2],
             lty=2,col="red")

So those two populations aren’t particularly bad, they just have a bunch of samples, the collections that are pretty bad are Nerpichi and Pymta.

This doesn’t look good for individual assignment, but for GSI it may all come out in the wash. Let’s do some 100% simulations.

100% simulations

Proof100_scenarios_ChumuSats <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios_ChumuSats) <- paste("All", RepGroups, sep = "-")

Proof100_results_ChumuSats <- assess_reference_loo(reference = ChumUsatBase, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = Proof100_scenarios,
                     alpha_collection = 10) #distribution of individuals from collections ~equal

Proof100_results_ChumuSats %>% 
  mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% #[c(2,3,6,5,4,1)]
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) %>% 
  filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>% 
  ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
  geom_bar(stat = 'identity') +
  geom_hline(yintercept = 0.9) +
  scale_fill_discrete(name = "Reporting Group") +
  facet_wrap(~ repunit_f) +
  xlab("Iteration") +
  ylab("Posterior Mean Reporting Group Proportion") 

Strange in 100% mixtures we seem to do fine for the NE Asia and SW_Alaska collections but the Upper Middle Yukon looks to have some issues. We probabily are overestimating the Southwest AK proportion in those simulations based on the loo results above.

Proof100_results_ChumuSats%>%
  filter(repunit_scenario == "All-UpMidYukon") %>%
  group_by(iter, repunit)%>%
  summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) %>% 
  mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% 
  ggplot(aes(x = repunit, y = repprop_posterior_mean, fill = repunit_f)) +
  geom_boxplot() +
  geom_hline(yintercept = 0.9,lty=2,color="red") +
  scale_fill_discrete(name = "Reporting Group") +
  xlab("Iteration") +
  ylab("Posterior Mean Reporting Group Proportion") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

So for the 100% simulations for the Upper Middle Yukon, we overestimate the proportion for Western Alaska. We will likely never see 100% of a mixture originating from the Upper/Middle Yukon unless we are sampling a fishery on the river, so lets proceed with some realistic fishery simulations that are taken from the BSAI trawl fishery.

Realistic Fishery Simulations

For the realistic fishery simulations we are going to pull the 2017 report for for the 2015 Bering Sea Walleye Pollock Trawl Fishery and Gulf of Alaska Groundfish Fisheries (J. A. V. Kondzela C.M. AND Whittle and Guyon 2017). We will do 10 replicates for mixtures that we might actually see in the fishery.

# Lets instead pull the 2017 analyses that were done and do 10 replicates for
# Mixtures that we might actually see in the fishery

ChumAnalyses<-read_csv("./MasterData/chum_all_years_gsi_summary.csv")

RepGroupMap<-cbind(as.character(RepGroups),
                  unique(as.character(ChumAnalyses$reporting_group))[c(2,4,3,5,6,1)]
)

ChumAnalyses$reporting_group<-plyr::mapvalues(from = RepGroupMap[,2] , 
                                              to = RepGroupMap[,1], 
                                              x= ChumAnalyses$reporting_group )

Chum2017<-ChumAnalyses %>%
  filter(year ==2017) %>%
  mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
  group_by(Analysis)%>%
  select(c(Analysis,reporting_group,mean))

RF_scenarios <- lapply(unique(Chum2017$Analysis), function(x) 
  Chum2017 %>%
    filter(Analysis == x) %>%
    ungroup()%>%
    select(-1)%>%
    setNames(.,c("repunit", "ppn" ))
  )

names(RF_scenarios) <- ChumAnalyses %>%
  filter(year ==2017) %>%
  mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
  select(Analysis) %>%
  unique()%>%
  unlist()

RF_results_ChumUsats <- assess_reference_loo(reference = ChumUsatBase, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = RF_scenarios,
                     alpha_collection = 10) #distribution of individuals from collections ~equal

RFsum_ChumUsats <- RF_results_ChumUsats %>%
  mutate(repunit_f = factor(x = repunit, levels = RepGroups))%>%
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) 
  

ggplot(RFsum_ChumUsats, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
  geom_point() +
  scale_color_discrete(name = "Reporting Group")+
  geom_abline(intercept = 0, slope = 1)+
  geom_abline(intercept = 0.1, slope = 1,lty=2)+
  geom_abline(intercept = -0.1, slope = 1,lty=2)+
  facet_wrap(~ repunit_f)+
  ylab("Posterior Mean Reporting Group Proportion") +
  xlab("Simulated Mixture Proportion")

That looks pretty good! Lets see how the SNPs measure up!

Chum SNP Baseline

Description of Baseline

There are a few different SNP baselines that ADF&G uses, but the most comprehensive geographically has 91 loci genotyped for 310 populations that are grouped into 9 reporting groups (DeCovich et al. 2012). The main difference between the SNP and microsatellite reporting groups are that the NE and SE Asian reporting groups are merged into a single reporting group and they have broken out some reporting groups from Coastal Western Alaska.

The regional reporting groups for the SNP baseline

The regional reporting groups for the SNP baseline

So the baseline that I recieved from Kyle Shedd was for 91 loci, but there appears to not be complete overlap with the GT_seq panel that Wes is using for the NOAA panel. So we first need to find the overlap in the two panels.

First lets read in the GTseq_RAD_chum SNPs that Wes sent and then look at the overlap with the ADF&G panel.

GTseq287<-read_csv("./MasterData/GTseq_RAD_chum.csv")%>%
            select("Fwd")%>%
            unlist()%>%
            gsub(pattern="_F",replacement="",.)%>%
            as.vector()%>%
            tolower()

ADFG_Panel<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92]%>%
            tolower()%>%
            str_split(pattern="\\.")%>% #this splits the haplotypes
            unlist()%>%
            gsub(pattern="-",replacement="_")

#Full match loci
any(GTseq287%in%ADFG_Panel)
## [1] FALSE
#so lets look for partial matches using grep and the species and gene names
Matches<-lapply(1:91, function(x) c(ADFG_Panel[x],
                           grep(gsub(pattern="_[0-9]+$",
                                   replacement="",
                                   ADFG_Panel[x]),GTseq287,value=T)))

#Now lets pull all the matches in length greater than 1
Matches[which(lapply(FUN=length,X=Matches)>1)]
## [[1]]
## [1] "oke_u200_385"  "oke_u2001_629" "oke_u2003_142" "oke_u2005_62" 
## 
## [[2]]
## [1] "oke_ahr1_78"  "oke_ahr1_278"
## 
## [[3]]
## [1] "oke_u1002_262" "oke_u1002_165"
## 
## [[4]]
## [1] "oke_psmd9_57"  "oke_psmd9_188" "oke_psmd9_057"
## 
## [[5]]
## [1] "oke_cks1_94" "oke_cks1_70"
## 
## [[6]]
## [1] "oke_u507_286" "oke_u507_087"
## 
## [[7]]
## [1] "oke_cks_389" "oke_cks1_70"

Yikes, this looks bad. Only 2 loci look like true overlap between the GTSeq287 panel and the ADF&G baseline. What is going on? Based on what Wes had in his ‘chinook_chum_sockeye_snp.xlsx’ file there should be 63 matches! Lets do a quick manual search of the ‘GTseq_RAD_order_200521 (with coho, steelhead).xlsx’ file that has all the loci named. The first match is ‘Oke_ACOT-100’, so let’s see if that locus is in the file that Wes sent me. Strange it isn’t! So maybe he was working off a different file. Let’s look for the WDFW GT_seq350 SNP panel and see what is going on.

The [GT_seq350 panel] (https://www.psc.org/download/466/information/11005/s15-i09-s14-i17-chum-salmon-southern-area-genetic-baseline-enhancement-part-1-and-part-2-amplicon-development-expanded-baseline-collections-and-genotyping.pdf) is avalable in a PDF so we have to copy and paste into excel to get something to work with that must be what Wes was working off of! Let me pull in those loci and re-run the check for overlapping loci. There are only 287 loci that are in the file that he sent me. So 63 loci are missing… what are the chances that those are the overlapping loci?

GTseq350<-read_csv("./MasterData/WDFW_GTSeq350.csv",col_names=F)%>%
            select(1)%>%
            unlist()%>%
            gsub(pattern="_F",replacement="",.)%>%
            as.vector()%>%
            tolower()

#Full match loci
any(GTseq350%in%ADFG_Panel)
## [1] TRUE
sum(GTseq350%in%ADFG_Panel)
## [1] 61
# so lets look for full and partial matches 
# using grep and the species and gene names
Matches<-lapply(1:91, function(x) c(ADFG_Panel[x],
                           grep(gsub(pattern="_[0-9]+$",
                                   replacement="",
                                   ADFG_Panel[x]),GTseq350,value=T)))

#Now lets pull all the matches in length greater than 1
Matches<-Matches[which(lapply(FUN=length,X=Matches)>1)]
#And lets clean up the matches that are exact 
Index<-which(lapply(FUN=length,X=Matches)>2)

NonExactMatch<-list()
x<-1
for (x in Index){
  if(any(Matches[[x]][1]==Matches[[x]][-1])) {
    Matches[[x]]<-c(Matches[[x]][1], Matches[[x]][which(Matches[[x]][1]==Matches[[x]][-1])+1])
  } else {
  NonExactMatch[[x]]<-Matches[[x]]
  x<-x+1
  }
}
                                                
Matches[which(lapply(FUN=length,X=Matches)>2)]
## [[1]]
## [1] "oke_psmd9_57"  "oke_psmd9_057" "oke_psmd9_188"
PartMat<-which(lapply(FUN=length,X=Matches)>2)


Matches[[PartMat]]<-Matches[[PartMat]][-3]

MatchMat<-matrix(data=unlist(Matches),nrow=length(Matches),ncol=2,byrow=T)
colnames(MatchMat)<-c("ADFG","WDFW")
MatchMat<-MatchMat[-(grep(pattern="cks1|pgap|ahr1",x=MatchMat[,1])),]
rownames(MatchMat)<-paste("Locus",seq(1,nrow(MatchMat),1),sep="_")  

ADFG_Loci2Use<-MatchMat[,1]

knitr::kable(MatchMat,"html")%>%
  kableExtra::kable_styling()#%>%
ADFG WDFW
Locus_1 oke_ppa2_635 oke_ppa2_635
Locus_2 oke_u200_385 oke_u200_385
Locus_3 oke_rfc2_618 oke_rfc2_618
Locus_4 oke_ras1_249 oke_ras1_249
Locus_5 oke_u1017_52 oke_u1017_52
Locus_6 oke_il_1racp_67 oke_il_1racp_67
Locus_7 oke_serpin_140 oke_serpin_140
Locus_8 oke_rh1op_245 oke_rh1op_245
Locus_9 oke_tcp1_78 oke_tcp1_78
Locus_10 oke_il8r2_406 oke_il8r2_406
Locus_11 oke_mlrn_63 oke_mlrn_63
Locus_12 oke_arf_319 oke_arf_319
Locus_13 oke_u2057_80 oke_u2057_80
Locus_14 oke_u2056_90 oke_u2056_90
Locus_15 oke_u2054_58 oke_u2054_58
Locus_16 oke_u2053_60 oke_u2053_60
Locus_17 oke_u2048_91 oke_u2048_91
Locus_18 oke_u2043_51 oke_u2043_51
Locus_19 oke_u2041_84 oke_u2041_84
Locus_20 oke_u2037_76 oke_u2037_76
Locus_21 oke_u2035_54 oke_u2035_54
Locus_22 oke_u2034_55 oke_u2034_55
Locus_23 oke_u2032_74 oke_u2032_74
Locus_24 oke_u2031_37 oke_u2031_37
Locus_25 oke_u2029_79 oke_u2029_79
Locus_26 oke_u2025_86 oke_u2025_86
Locus_27 oke_u2011_107 oke_u2011_107
Locus_28 oke_u2006_109 oke_u2006_109
Locus_29 oke_u1024_113 oke_u1024_113
Locus_30 oke_u1023_147 oke_u1023_147
Locus_31 oke_u1015_255 oke_u1015_255
Locus_32 oke_u1008_83 oke_u1008_83
Locus_33 oke_u1002_262 oke_u1002_262
Locus_34 oke_thic_84 oke_thic_84
Locus_35 oke_sylc_90 oke_sylc_90
Locus_36 oke_slc1a3a_86 oke_slc1a3a_86
Locus_37 oke_rspry1_106 oke_rspry1_106
Locus_38 oke_rab5a_117 oke_rab5a_117
Locus_39 oke_psmd9_57 oke_psmd9_057
Locus_40 oke_nc2b_148 oke_nc2b_148
Locus_41 oke_mgll_49 oke_mgll_49
Locus_42 oke_lamp2_186 oke_lamp2_186
Locus_43 oke_glrx1_78 oke_glrx1_78
Locus_44 oke_f5_71 oke_f5_71
Locus_45 oke_eif4g1_43 oke_eif4g1_43
Locus_46 oke_e2ig5_50 oke_e2ig5_50
Locus_47 oke_dcxr_87 oke_dcxr_87
Locus_48 oke_ccd16_77 oke_ccd16_77
Locus_49 oke_catb_60 oke_catb_60
Locus_50 oke_brp16_65 oke_brp16_65
Locus_51 oke_brd2_118 oke_brd2_118
Locus_52 oke_atp5l_105 oke_atp5l_105
Locus_53 oke_acot_100 oke_acot_100
Locus_54 oke_u507_286 oke_u507_286
Locus_55 oke_u509_219 oke_u509_219
Locus_56 oke_ctgf_105 oke_ctgf_105
Locus_57 oke_u506_110 oke_u506_110
Locus_58 oke_u504_228 oke_u504_228
Locus_59 oke_gpdh_191 oke_gpdh_191
Locus_60 oke_hp_182 oke_hp_182
Locus_61 oke_cks_389 oke_cks_389
Locus_62 oke_u1021_102 oke_u1021_102
  #kableExtra::row_spec(7, bold = T, color = "black", background = "#D7261E")

So looking down the list we see that we have locus pgap-111 in the ADF&G baseline and pgap-92 in the WDFW baseline. In the ADF&G report they state, ‘These SNPs were combined into a phenotypic locus, Oke_pgap-111-92, in preliminary investigations of linkage. However, Oke_pgap-111 was retained and Oke_pgap-92 dropped after the linkage was not shown to be useful in proof tests using the complete baseline.’ It was shown to be linked in 80% of the populations.

Loci oke_ahr1_78 and oke_ahr1_278 are similarly named, but they are not the same locus. ADF&G has both listed, but only 78 is included in the final marker panel. WDFW used oke_ahr1_278.

So we also had a partial match on oke_cks1_94 and oke_cks1_70. It turns out that ADF&G had both loci included in their report, but only 94 was included in the panel. WDFW did the reverse in their panel, they retained oke_cks1_70 and dropped oke_cks1_94. We can remove these.

Wes had loci Oke_CD123_62_F and Oke_CD81-108 as matching. I think this is likely just an error from doing it so quickly. He also had Oke_KPNA2_87_F included by it is not in the final 350GTseq panel for WDFW, but it is used in the ADF&G baseline.

I had three loci that matched between the ADF&G and WDFW baseline that Wes didn’t have (oke_glrx1_78, oke_mgll_49, and oke_u200_385).

Now I need to create an input file with just the 62 loci that overlap

ADFG_BaselineSNPsLoci<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92]%>%
            tolower()%>%
            str_split(pattern="\\.")%>% #this splits the haplotypes
            unlist()%>%
            gsub(pattern="-",replacement="_")

AlleleIndex<-which(ADFG_BaselineSNPsLoci%in%ADFG_Loci2Use == T)[-62]

# Huge note!!!! Locus#90 oke_u1021_102 is a micro-haplotype from two loci. 
# It is actually scored as an integer for the haplotype
# We will probably want to drop it later if we can't figure out how to 
# get rid of the associated locus

ADFG_BaselineSNPsLoci[AlleleIndex]

ADFG_BaselineSNPs<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[-c(1:92)]
PopIndex<-paste(grep("Pop|POP",ADFG_BaselineSNPs)+1,":",c(((grep("Pop|POP",
                               ADFG_BaselineSNPs)[-1])-1),length(ADFG_BaselineSNPs)),sep="")


#So now we want to write out a nicely formatted genepop file
outfile<-"ADFG_62SNPBaseline.gen"
write.table("ADFG_ReducedSNPbaseline",outfile,append=F,quote=F,row.names = F,
            col.names = F)
#I am going to write the loci names that ADF&G uses
write.table(read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92][AlleleIndex],
            outfile,append=T,quote=F,row.names = F,col.names = F)

for (P in 1:length(PopIndex)){
  write.table("Pop",
            outfile,append=T,quote=F,row.names = F,col.names = F)
  
  PopTemp<-ADFG_BaselineSNPs[(eval(parse(text=PopIndex[P])))]%>%
    str_split(pattern= "\\s+")%>%
    lapply(.,"[",c(1,2,(AlleleIndex+2)))%>%
    lapply(.,paste,collapse = " ")%>%
    unlist()#%>%
    #paste(.,"\n",sep="")
  
  write.table(PopTemp,
            outfile,append=T,quote=F,row.names = F,col.names = F)
}

Let’s take a look at the breakdown of the reporting groups and the populations within each reporting group.

For the reporting groups the size of the GOA/PNW greatly exceeds that of the other groups. And Southwest AK has the fewest number of collections. Let’s take a closer look at how man indiviuals represent each collection.

PopBase<-read.csv("./MasterData/ADFG_ChumSNP_PopNames.csv")[1:310,1:5]

RepGroups_ChumSNPs<-PopBase%>%
          group_by(RepGroup)%>%
          summarise(nPops = length(unique(Location)))%>%
          .[c(1,9,5,6,7,8,3,2,4),]
  
knitr::kable(RepGroups_ChumSNPs)
RepGroup nPops
Asia 35
UpperYukon 24
KotzebueSound 8
NorthernDistrict 11
NorthwestDistrict 7
SouthPenninsula 14
CWAK 64
Chignik/Kodiak 35
EastofKodiak 110

So again we have a lot of populations that are grouped into the East of Kodiak reporting group. There are also very few populations in the Kotzebue Sound, Northern District, Northwet District, and South Penninsula reporting groups.

We are going to evaluate the baseline with these reporting groups.

HWE & LD

Again I am not too concerned with issues of disequilibrium becuase these loci have been so well vetted.

Fst

We are going to follow the same procedure as above with the microsatellites to compute Nei’s \(F_{ST}\) with the package hierfstat. Again, eval for this bit of the code is turned off so we don’t have to wait for the bootstrapping which can take a long time to perform with so many populations.

Chum_SNPsBase_genid<-adegenet::read.genepop("./ADFG_62SNPBaseline.gen",ncode=2)
Chum_SNPsBase_hier <- hierfstat::genind2hierfstat(Chum_SNPsBase_genid)

ChumSNPs_Fst<-hierfstat::pairwise.neifst(Chum_SNPsBase_hier)

ChumSNPs_Fst_bs <- hierfstat::boot.ppfst(dat=Chum_SNPsBase_hier,nboot=100,quant=c(0.025,0.975),diploid=TRUE)

diag(ChumSNPs_Fst)<-0

NJ tree

NJ.chumSNPs<-phangorn::NJ(ChumSNPs_Fst)
#NJ.chumSNPs$tip.label<-PopBase$Location

p3<-ggtree(NJ.chumSNPs) 

Temp<-PopBase[,c(3,1)]

p_ChumSNPs<-p3 %<+% Temp +
  geom_tiplab(aes(fill = factor(RepGroup)),
              size = 1.75,
              color = "black", # color for label font
              geom = "label",  # labels not text
              label.padding = unit(0.15, "lines"), # amount of padding around the labels
              label.size = 0.01) + 
  theme(legend.position = c(0.823,0.8), 
        legend.title = element_blank(), # no title
        legend.key = element_blank())

p_ChumSNPs

The South Penninsula group is spread over multiple parts of the tree, but those collections do group with other Southwest AK collections. There is an interesting group of samples in the middle of the tree that we might want to take a closer look at.

#p_ChumSNPs + geom_text2(aes(subset=!isTip, label=node))
viewClade(p_ChumSNPs, node=378)

Those are the Susitna River collections.

PCA

From the NJ tree PC1 should separate East of Kodiak from all other samples and PC2 might separate the Asia reproting group from Alaska.

X_ChumSNPs <- scaleGen(Chum_SNPsBase_genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca_ChumSNPs <- dudi.pca(X_ChumSNPs,cent=F,scale=F,scannf=FALSE,nf=3) #nf is kept axes, note you can center and scale in this step
barplot(pca_ChumSNPs$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))

PopRGmap<-data.frame(Pop = as.character(PopBase[,3]),
                     RepGroup = as.character(PopBase[,1]),
                     stringsAsFactors = F)

pca_ChumSNPsdf<-data.frame(Ind = dimnames(Chum_SNPsBase_genid@tab)[[1]],
                  Pop = pop(Chum_SNPsBase_genid),
                  RepGroup = plyr::mapvalues(
                    x=pop(Chum_SNPsBase_genid), from = PopRGmap$Pop, 
                    to = PopRGmap$RepGroup),
                  pca_ChumSNPs$li,
                  stringsAsFactors = F) 


pca_ChumSNP1 <- ggplot(pca_ChumSNPsdf) +
  geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
  theme(axis.line = element_blank(),
        panel.background = element_blank()) +
  geom_vline(xintercept = 0)+
  geom_hline(yintercept = 0)+
  scale_color_discrete(name = "Reporting Group") +
  theme(legend.position = "none")

pca_ChumSNP2 <- ggplot(pca_ChumSNPsdf) +
  geom_point( aes(Axis2, Axis3,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
  theme(axis.line = element_blank(),
        panel.background = element_blank()) +
  geom_vline(xintercept = 0)+
  geom_hline(yintercept = 0)+
  scale_color_discrete(name = "Reporting Group") 

cowplot::plot_grid(pca_ChumSNP1,pca_ChumSNP2)

PCA_Load_ChumSNPs<-data.frame(pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),],
                       pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),]^2,
                       cumsum(pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),]^2))

head(PCA_Load_ChumSNPs,10)%>%
  select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)%>%
  knitr::kable()
Loadings Variance CumSum
Oke_RFC2-618.02 0.2368394 0.0560929 0.0560929
Oke_RFC2-618.01 -0.2368394 0.0560929 0.1121858
Oke_CATB-60.01 0.1963107 0.0385379 0.1507237
Oke_CATB-60.02 -0.1963107 0.0385379 0.1892616
Oke_DCXR-87.02 0.1716159 0.0294520 0.2187136
Oke_DCXR-87.01 -0.1716159 0.0294520 0.2481656
Oke_rab5a-117.02 -0.1652246 0.0272992 0.2754648
Oke_rab5a-117.01 0.1652246 0.0272992 0.3027640
Oke_PPA2-635.02 -0.1543188 0.0238143 0.3265783
Oke_PPA2-635.01 0.1543188 0.0238143 0.3503926
qplot(y=PCA_Load_ChumSNPs$CS1.2)+
  labs(y = "Variance Explained",x = "SNP allele")+
  geom_hline(yintercept = 0.9,lty=2,col="red")

PC1 separates Coastal Western AK and the Yukon from all other collections. It appears that the second PC splits out some of the East of Kodiak collections, but I am not going to dig into this right now. The third PC pulls out the Asia collections. Locus RFC2-618 explains quite a bit of variance in the first PC. It was identified as an outlier locus in Seeb et al. (Seeb et al. 2011). From that paper it looks like locus U502-241 might be similarly effective, but it is not included in this panel.

GSI evaluations

For the GSI evaluations I need to convert the genepop file to rubias.

LOO

source("./functions/Genepop2rubias.R")


####Genepop2rubias
infile="./ADFG_62SNPBaseline.gen"
outfile = "ADFG_62SNPBaselineRubias.csv"
ReportingGroupFile="./ChumSNPs_RepGroups.csv"

Genepop2rubias_baseline(infile=infile,
                        outfile=outfile,
                        digits=2,
                        ReportingGroupFile=ReportingGroupFile)
write.csv(file="ChumSNPs_RepGroups.csv",x=PopRepGroups,row.names = F,quote = F)

ChumSNPsBase<-read.csv(file.path(outfile))
ChumSNPsBase[,1:4]<-lapply(1:4,function(x) as.character(ChumSNPsBase[,x]))


CollectionInfo_ChumSNPs<-ChumSNPsBase%>%
          group_by(repunit,collection)%>%
          summarise(nInd = length(unique(indiv)))


sa_ChumSNPs <- self_assign(reference = ChumSNPsBase, gen_start_col = 5)
## Summary Statistics:
## 
## 32817 Individuals in Sample
## 
## 61 Loci: Oke_ACOT.100, Oke_arf.319, Oke_ATP5L.105, Oke_brd2.118, Oke_brp16.65, Oke_CATB.60, Oke_ccd16.77, Oke_CKS.389, Oke_ctgf.105, Oke_DCXR.87, Oke_e2ig5.50, Oke_eif4g1.43, Oke_f5.71, Oke_glrx1.78, Oke_GPDH.191, Oke_HP.182, Oke_il.1racp.67, Oke_IL8r2.406, Oke_LAMP2.186, Oke_mgll.49, Oke_MLRN.63, Oke_nc2b.148, Oke_PPA2.635, Oke_psmd9.57, Oke_rab5a.117, Oke_ras1.249, Oke_RFC2.618, Oke_RH1op.245, Oke_RSPRY1.106, Oke_serpin.140, Oke_slc1a3a.86, Oke_sylc.90, Oke_TCP1.78, Oke_thic.84, Oke_U1002.262, Oke_U1008.83, Oke_U1015.255, Oke_U1017.52, Oke_U1023.147, Oke_U1024.113, Oke_u200.385, Oke_U2006.109, Oke_U2011.107, Oke_U2025.86, Oke_U2029.79, Oke_U2031.37, Oke_U2032.74, Oke_U2034.55, Oke_U2035.54, Oke_U2037.76, Oke_U2041.84, Oke_U2043.51, Oke_U2048.91, Oke_U2053.60, Oke_U2054.58, Oke_U2056.90, Oke_U2057.80, Oke_U504.228, Oke_U506.110, Oke_U507.286, Oke_U509.219
## 
## 9 Reporting Units: Asia, KotzebueSound, CWAK, UpperYukon, NorthernDistrict, NorthwestDistrict, SouthPenninsula, Chignik/Kodiak, EastofKodiak
## 
## 310 Collections: CMNAMD05F, CMTOKA02, CMKUSH98, CMNISH97, CMABAS98, CMSHIN02, CMTESH01, CMYURA97E, CMYURA97L, CMCHIT03E, CMSASA90, CMSHAR01, CMTOKOR05, CMTSHIB04, CMGAKK03E, CMNAIB95, CMTYM95, CMBOL97, CMPARA98, CMAMU97, CMBIST98, CMHAIR93, CMOZER98, CMPYMT93B, CMPENZ93, CMKOL90B, CMVORO93, CMKAM90, CMPALA98UW, CMMAG91, CMOSS96, CMTAUY90, CMOLA90, CMOKLA93, CMKANC91, CMUDAR94, CMINMA05UW, CMKIAN04UW, CMKOB91, CMKEL91, CMNOA91, CMSEL94, CMAGIA05, CMAMER05, CMELDO05, CMNOME05, CMPIL94, CMSNA95, CMSOL96, CMFISH04, CMKWIN04UW, CMNIUK04, CMTUBU09, CMSHAK05, CMPIKM05, CMKOYU05, CMUNAL04UW, CMUNGA10, CMBLAC06, CMAND93W, CMANDR04EUW, CMCHUL89, CMANV93D, CMANV92B, CMYUKA93, CMKALT92, CMNUL94, CMGIS04, CMCLE02, CMHUS93, CMSFKO96E, CMHENS04, CMMELO03, CMTOZI03, CMKWE94, CMTUL07, CMKIS94, CMANI92, CMSALM07UW, CMHOL95, CMKOG07, CMKAS94, CMGEO07, CMSTO94L, CMNECO07, CMTATL07, CMNUN94, CMTAK07, CMBIGR08, CMSFKU08, CMWINDF08, CMKAN07UW, CMGOOD06NF, CMMEKO06C, CMTOGRM11, CMOSVIAK10UW, CMSUNS06, CMIOW10UW, CMMUL94, CMKOKW11, CMNUS93, CMSTUY93, CMKLUTU10, CMBRIB93, CMALAG10UW, CMWHLMT10UW, CMPUMIC10, CMWAND10, CMHENS95, CMSFKO96L, CMJIM10, CM17MI10F, CMTAN93, CMTOKA94, CMKAN01, CMCHE94, CMSAL01, CMDEL94, CMBLU92, CMBSAL01, CMCHAN01, CMSHE92, CMBLAC95, CMOLDC07, CMFBR07UW, CMKLUA07, CMPEL93, CMMINTS89, CMTATN92, CMBCK95, CMTES92, CMDON94, CMWIGG09UW, CMMES92UW, CMPLEN09UW, CMMESB09UW, CMILNIK02UW, CMNCSEN10UW, CMRHMB09UW, CMLAW92UW, CMCOAL08UW, CMDEERV08UW, CMNELSR08UW, CMMOF96UW, CMJOS09UW, CMFRO09UW, CMALL96UW, CMTRA92UW, CMSTC09UW, CHPET92UW, CMLIJ09UW, CMSANC96UW, CMRUS93UW, CMDEL96UW, CMBEL92UW, CMVOL09UW, CMRUB96UW, CMCAN92UW, CMZAC92UW, CHBAL92UW, CMCOL96UW, CMCHI96UW, CMSTE09UW, CMSTEPR09UW, CMIVAN09UW, CMWESJ93UW, CMWESK93UW, CMWESE93UW, CMAMBM09UW, CMNECR08UW, CMOCEB09UW, CMNAKIL08UW, CMCHIGK09UW, CMKIAL09UW, CMPASS09UW, CMDRYBR09UW, CMWESD93UW, CMWESG93UW, CMBIGRI09UW, CMKARL09UW, CMSTU09UW, CMBSU92UW, CMDEAD09UW, CMSITKI09UW, CMPORTNE09UW, CMBARL09UW, CMWKIL09UW, CHDOG92UW, CMCOXC09UW, CMGULLC09UW, CMEAGLH09UW, CMROUGH09UW, CHAME92UW, CMRUSSI07UW, CHKIZ92UW, CHUGA92UW, CMSPIRU09UW, CMZACR09UW, CMKITB09UW, CMMcN96, CMCHU93, CMSUS96, CMTALK95, CMSPINK08, CMLSUS10UW, CMWILL10, CMCARMLK10, CMWILLIW10, CMBEART10, CMCONS10, CMSIWA10, CMWHN09, CMWEL10, 5P92EKEE, CMPWS95A, CMCHILK06, CMHERM06, CMKLEH06, CMWELLB06, CMSAWM06S, CMFCJUN06S, CMTAKU06F, CMPROS10, CMSANB06S, CMADMI10, CMSWANC10, CMLONG92, CMSALTE92, CMFORD06F, CMSIST06, CMHFHAT06, CMRALPH06, CMSAOB06, CMMEDV09, CMNAKWA06, CMWCRA06, CMKLAH86, CMHARDR10, CMDISA98, CMDISA10F, CMKARTA06, CMLAGO10F, CMDRYB06S, CMNARM87, CMSAGI10, CMSAMPL10, CMCARR09, CMNEET06S, CMNEET06F, CMTRACR06, CMFCHYD06, CMFISH88L, CMHIDIN09, CMNAKA06S, CMSTAG88, CMECST88, CMKITW06UW, CMBAGH89, CMPALL06, CMSALM89, CMSEDG89, CMSUPR89, CMKITA06, CMKITIM06, CMSNOOT06, CMWARC06, CMBIGQU06, CMGOLD06, CMLQUA06, CMPUNTL06, CMCONU06, CMNAHM06, CMNITI06, CMSARI06, CMSOOK06, CMSUGS06, CMNIMP06, CMALOU06, CMINCH06, CMNORR06, CMWEAV06, CMSALM04S, CMLSKA98F, CMSKYK94F, CMSNOQ96, CMUSAU94F, CMELWH04UW, CMDIRU02, CMMILCR94, CMSHER94F, CMSHER94S, CMBMI03F, CMDEW98F, CMDOS03S, CMHAMM05, CMHAM03S, CMLIL06F, CMLIL01S, CMQUIL97, CMUNI04S, CMI205B00F, CMJIM01S, CMJOH94S, CMNISQ04UW, CMKAL03W, CMNCR98F, CMGRA01F, CMLCR94F, CMSATS98, CMSKA02F
## 
## 0.55% of allelic data identified as missing
sa_to_repu_ChumSNPs <- sa_ChumSNPs %>%
  group_by(indiv, collection, repunit, inferred_repunit) %>%
  summarise(repu_scaled_like = sum(scaled_likelihood))

RepGroups <- unique(PopRGmap$RepGroup)

sa_mean_scaled_like_ChumSNPs <- sa_to_repu_ChumSNPs %>% 
  mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
  mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
  group_by(repunit_f, inferred_repunit_f) %>% 
  summarise(mean_repu_scaled_like = mean(repu_scaled_like))  #take the mean of the scaled liklihoods
  
  
#Generate a heatmap  
  ggplot(sa_mean_scaled_like_ChumSNPs,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
  geom_tile(aes(fill = mean_repu_scaled_like)) +
    geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like_ChumSNPs[,3]),fmt='%#.2f')))+
  scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Reporting Unit") +
  ylab("Inferred Reporting Unit")

It looks like grouping Kotzebue Sound into Coastal Western Alaska and grouping the Northern District, Northwestern District, South Penninsula, and Chignik/Kodiak collections together into a SWAK group like with the microsatellites.

100% simulations

Proof100_scenarios_ChumSNPs <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios_ChumSNPs) <- paste("All", RepGroups, sep = "-")

Proof100_results_ChumSNPs <- assess_reference_loo(reference = ChumSNPsBase, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = SNPProof100_scenarios_ChumSNPs,
                     alpha_collection = 10) #distribution of individuals from collections ~equal

Proof100_results_ChumSNPs %>% 
  mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% 
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) %>% 
  filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>% 
  ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
  geom_bar(stat = 'identity') +
  geom_hline(yintercept = 0.9) +
  scale_fill_discrete(name = "Reporting Group") +
  facet_wrap(~ repunit_f) +
  xlab("Iteration") +
  ylab("Posterior Mean Reporting Group Proportion") 

The South Penninsula group looks pretty darn bad. We can look at just those 100% mixtures to see which reporting group is being overestimated.

SNPProof100_results_ChumSNPs%>%
  filter(repunit_scenario == "All-SouthPenninsula") %>%
  group_by(iter, repunit)%>%
  summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) %>% 
  mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% 
  ggplot(aes(x = repunit, y = repprop_posterior_mean, fill = repunit_f)) +
  geom_boxplot() +
  geom_hline(yintercept = 0.9,lty=2,color="red") +
  scale_fill_discrete(name = "Reporting Group") +
  xlab("Iteration") +
  ylab("Posterior Mean Reporting Group Proportion") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

When the mixture is composed entirely of fish from the South Penn reporting group we overestimate the contribution from Chignik/Kodiak.

We don’t have any realistic fishery proportions for each of the new reporting groups so I am just going to simulate a bunch of differnet proportions from the Dirichlet. I should be able to simulate proportions for each reporting group that span 0-1.

Realistic Fishery Simulations

#First create some fishery proportions
DirichletProps<-as.data.frame(rbind(MCMCpack::rdirichlet(15,c(rep(0.01,9))),#somefixed
                      MCMCpack::rdirichlet(15,c(rep(0.1,9))),
                      MCMCpack::rdirichlet(15,c(rep(1,9))),
                      MCMCpack::rdirichlet(15,c(rep(10,9)))))

colnames(DirichletProps)<-RepGroups
DirichletProps$Analysis<-1:nrow(DirichletProps)

RFpropsDir<-reshape2::melt(DirichletProps, id=c("Analysis"))


#Make a list of scenarios
RF_scenarios_ChumSNPs <- lapply(unique(RFpropsDir$Analysis), function(x) 
  RFpropsDir %>%
    filter(Analysis == x) %>%
    ungroup()%>%
    select(-1)%>%
    setNames(.,c("repunit", "ppn" ))
  )


names(RF_scenarios_ChumSNPs)<-paste("RF-",seq(1,nrow(DirichletProps)),sep="")


unique(CollectionInfo_ChumSNPs$repunit)[c(1,5,3,9,6,7,8,2,4)]
## [1] "Asia"              "KotzebueSound"     "CWAK"             
## [4] "UpperYukon"        "NorthernDistrict"  "NorthwestDistrict"
## [7] "SouthPenninsula"   "Chignik/Kodiak"    "EastofKodiak"
RF_results_ChumSNPs <- assess_reference_loo(reference = ChumSNPsBase, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = RF_scenarios_ChumSNPs,
                     alpha_collection = 10, #distribution of individuals from collections ~equal
                     return_indiv_posteriors = F)

#for some reason I am getting NAs in the repunits


RFsum_ChumSNPs <- RF_results_ChumSNPs %>%
  mutate(repunit_f = factor(x = repunit, levels = RepGroups[c(1,5,3,9,6,7,8,2,4)]))%>%
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) 
  

ggplot(RFsum_ChumSNPs, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
  geom_point() +
  scale_color_discrete(name = "Reporting Group")+
  geom_abline(intercept = 0, slope = 1)+
  geom_abline(intercept = 0.1, slope = 1,lty=2)+
  geom_abline(intercept = -0.1, slope = 1,lty=2)+
  facet_wrap(~ repunit_f)+
  ylab("Posterior Mean Reporting Group Proportion") +
  xlab("Simulated Mixture Proportion")

Well these reporting groups don’t look like they are going to work very well with a reduced baseline.

NOAA Reporting Groups

Let’s see what happens when we group these baselines similar to the microsatellite reporting groups. I say similar becuase I am just going to add Kotzebue Sound to the CWAK and then group the four reporting groups on the penninsula and Kodiak togehter to make a SWAK group. We can also break apart the Asia reporting group.

#make a new tibble so we don't mess up the first
ChumSNPsBaseNRG<-ChumSNPsBase

#read in the poprep groups for 
#PoprepGroups<-read.csv("./ChumSNPs_RepGroups.csv",stringsAsFactors = F)

#PoprepGroups$ADFGnames<-unique(ChumSNPsBaseNRG[,3])

ChumSNPsBaseNRG[,2]<-plyr::mapvalues(from=PopBase[,3],
                                    to=as.character(PopBase[,5]),
                                    x=ChumSNPsBaseNRG[,3])

RepGroupsNew<-unique(ChumSNPsBaseNRG$repunit)
RepGroupsOld<-unique(ChumUsatBase$repunit)



RepGroupMap<-cbind(RepGroupsNew,
                  RepGroupsOld
)

#sub in the new names
Chum2017[,2]<-plyr::mapvalues(from=RepGroupMap[,2],
                                     to=RepGroupMap[,1],
                                     x=unlist(Chum2017[,2]))

RF_scenarios <- lapply(unique(Chum2017$Analysis), function(x) 
  Chum2017 %>%
    filter(Analysis == x) %>%
    ungroup()%>%
    select(-1)%>%
    setNames(.,c("repunit", "ppn" ))
  )

names(RF_scenarios) <- ChumAnalyses %>%
  filter(year ==2017) %>%
  mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
  select(Analysis) %>%
  unique()%>%
  unlist()

RF_results_ChumSNPs_NRG <- assess_reference_loo(reference = ChumSNPsBaseNRG, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = RF_scenarios,
                     alpha_collection = 10) #distribution of individuals from collections ~equal

RFsum_ChumSNPs_NRG <- RF_results_ChumSNPs_NRG %>%
  mutate(repunit_f = factor(x = repunit, levels = RepGroupsNew))%>%
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) 
  

ggplot(RFsum_ChumSNPs_NRG, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
  geom_point() +
  scale_color_discrete(name = "Reporting Group")+
  geom_abline(intercept = 0, slope = 1)+
  geom_abline(intercept = 0.1, slope = 1,lty=2)+
  geom_abline(intercept = -0.1, slope = 1,lty=2)+
  facet_wrap(~ repunit_f)+
  ylab("Posterior Mean Reporting Group Proportion") +
  xlab("Simulated Mixture Proportion")

That looks great! It does as well or better than the microsatellite baseline.

Chinook SNP Baseline

Description of Baseline

The SNP baseline contains genetic information (43 SNP DNA markers) for 23,205 individuals from 172 populations grouped into 11 geographic regions (Templin et al. 2011).

Let’s take a look at the baseline that NOAA has been using.

ChinookSNPBase<-read.csv("./MasterData/ChinookBaseline_rubias.csv",stringsAsFactors = F)
RepGroups_Chinook<-ChinookSNPBase%>%
          group_by(repunit)%>%
          summarise(nPops = length(unique(collection)),
                    nInd = length(unique(indiv)))
  
knitr::kable(RepGroups_Chinook[c(9,3,5,10,6,8,4,7,2,1,11),])
repunit nPops nInd
Russia 4 340
CoastWAK 29 4520
MidYukon 8 1097
UpYukon 13 1734
NAKPen 6 479
NWGOA 19 3212
Copper 11 1181
NEGOA 7 987
CoastSEAK 25 3739
BC 36 4368
WestCoastUS 14 1548

Russia has very few stocks, but at the same time Russia contributes very few fish to the mixtures. All the other reporting groups have a fair number of collections. The Northern AK penninsula has only 6 collections, and Wes has had issues with convergence for this reporting group in the past (Larson et al. 2013), so let’s keep an eye on it.

CollectionInfo_Chinook<-ChinookSNPBase%>%
          group_by(repunit,collection)%>%
          summarise(nInd = length(unique(indiv)))
#knitr::kable(CollectionInfo)

Within the collections the average size is 134.91 (min = 42 and max = 397).

HWE & LD

Again we will breeze by this step.

\(F_{st}\)

We have a cleaned up baseline file in genepop format from some other analyses I was doing. Adegenet doesn’t like haplotype data, so we need to first remove the last locus. Then reimport the file. I created a function awhile back to drop a locus from a genepop file. Let’s source that function and use it to get rid of locus C3N3. We will use it in the baseline evaluations, but drop it for the \(F_{ST}\) NJ tree and PCA plots.

source("./functions/RmGenepopLocus.R")
LocusRM(input="./MasterData/ChinookBaseline_clean.gen",locus2rm='C3N3',NewFileName='ChinookBaseline_42l.gen',digits=2,PopNameAsPrefix=T)

Next we
pulled it into R with adgenet’s read genepop function and then calculate some Nei’s \(F_{st}\) values. In the markdown document we won’t evaluate this piece everytime becuase the bootstrapping over 300 populations can take some time. Instead we will read in the results at the top of the document.

Chinook_SNPBase.genid<-adegenet::read.genepop("./ChinookBaseline_42l.gen",ncode=2)
ChinookSNPBase_hierf <- hierfstat::genind2hierfstat(Chinook_SNPBase.genid)

Chinook_SNPFst.Nei<-hierfstat::pairwise.neifst(ChinookSNPBase_hierf)
Chinook_SNPFst.WC<-hierfstat::pairwise.WCfst(ChinookSNPBase_hierf)

Chinook_SNPFst_bs <- hierfstat::boot.ppfst(dat=ChinookSNPBase_gdf,nboot=100,quant=c(0.025,0.975),diploid=TRUE)

diag(Chinook_SNPFst.Nei)<-0

NJ Tree

A NJ tree was constructed in the package phangorn and visualized with the package ggtree.

NJ.ChinookSNP.Nei<-phangorn::NJ(Chinook_SNPFst.Nei)

NJ.ChinookSNP.Nei$tip.label<-unique(gsub(pattern="\\.[0-9]+$",replacement="",Chinook_SNPBase.genid@pop))


PopRepGroups_ChinookSNP<-ChinookSNPBase%>%
   group_by(repunit,collection)%>%
   slice(1)%>%
   select(Pop = collection ,RepGroup = repunit)%>%
   ungroup()%>%
   mutate(RepGroup_f = factor(x = RepGroup, levels = levels(as.factor(unique(ChinookSNPBase$repunit)))[c(9,3,5,10,6,8,4,7,2,1,11)]))

#p_ChinookSNP + geom_text2(aes(subset=!isTip, label=node))

#viewClade(pNei_ChinookSNP, node=218)
  
pNei_ChinookSNP<-ggtree(NJ.ChinookSNP.Nei) %<+% PopRepGroups_ChinookSNP +
  geom_tiplab(aes(fill = RepGroup_f),
              size = 1.75,
              color = "black", # color for label font
              geom = "label",  # labels not text
              label.padding = unit(0.15, "lines"), # amount of padding around the labels
              label.size = 0.01) + 
  theme(legend.position = c(0.8,0.2), 
        legend.title = element_blank(), # no title
        legend.key = element_blank())


pNei_ChinookSNP

This is an unrooted tree, so all the short branch lengths for coastal WAK look fine. The Mid and Upper Yukon cluster together well. British Columnbia and Coastal Southeast AK are kind of a mess, but that isn’t suprising. For the most part this looks better than expected with 11 reporting groups, but at the same time they have been using this SNP panel for awhile.

PCA

X_Chinook <- scaleGen(Chinook_SNPBase.genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca_Chinook <- dudi.pca(X_Chinook,cent=F,scale=F,scannf=FALSE) #nf is kept axes, note you can center and scale in this step
barplot(pca_Chinook$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))

PopRGmap_Chinook<-data.frame(Pop = as.character(unique(pop(Chinook_SNPBase.genid))),
                     RepGroup = as.character(sapply(1:length(unique(pop(Chinook_SNPBase.genid))),function(x)
                       ChinookSNPBase[grep(unique(pop(Chinook_SNPBase.genid))[x],
                                                   x=ChinookSNPBase$indiv),2])),
                     stringsAsFactors = F)

PCAdf_Chinook<-data.frame(Ind = dimnames(Chinook_SNPBase.genid@tab)[[1]],
                  Pop = pop(Chinook_SNPBase.genid),
                  RepGroup = plyr::mapvalues(
                    x=pop(Chinook_SNPBase.genid), from = PopRGmap_Chinook$Pop, 
                    to = PopRGmap_Chinook$RepGroup),
                  pca_Chinook$li,
                  stringsAsFactors = F) 

PCAdf_Chinook$Pop <- gsub(PCAdf_Chinook$Pop, pattern="_[0-9]+$", replacement = "" )


ggplot(PCAdf_Chinook) +
  geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
  theme(axis.line = element_blank(),
        panel.background = element_blank()) +
  geom_vline(xintercept = 0)+
  geom_hline(yintercept = 0)+
  scale_color_discrete(name = "Reporting Group") 

pca.Chinook_Loads<-data.frame(Loci=row.names(pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]),
                       pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),],
                       pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]^2,
                       cumsum(pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]^2))

Loads<-head(pca.Chinook_Loads,10)%>%
  select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)

#pca.Chinook_Loads[grep(pattern="MHC",x=pca.Chinook_Loads[,1]),]

It looks like the major division is just past Coastal SEAK. The second PC then distinguishes the upper and middle Yukon.

GSI evaluations

Self assignment (LOO)

sa_ChinookSNPs <- self_assign(reference = ChinookSNPBase, gen_start_col = 5)
## Summary Statistics:
## 
## 23205 Individuals in Sample
## 
## 43 Loci: ARF, AsnRS72, C3N3, ETIF1A, FARSLA220, FGF6A, GH2, GPDH, GPH318, GST207, GST375, GTH2B550, HGFA, hnRNPL533, HSP90B100, IGF191, IK1328, IL1RA, LEI292, MHC1, MHC2, NOD1, NRP, OPLW173, OPSW152, P450, Prl2, PrpI120, RAG3, RFC2, S71, SClkF2, SERPC1209, SL, TAPBP, Tnsf, U200167, U211, U212297, UNKN4150, UNKN6187, xKER137, zP3b
## 
## 11 Reporting Units: Russia, CoastWAK, MidYukon, UpYukon, NAKPen, NWGOA, Copper, NEGOA, CoastSEAK, BC, WestCoastUS
## 
## 172 Collections: KBIST98L, KBOLS0298E, KKAMC98L97, KPAKH02, KPILG0506, KUNAL05, KGOLS0506, KANDR0203, KANVI02, KGISA01, KTOZI0203, KHENS01, KSFKOY03, KBEAV97, KCHAN020304, KSHEE020406, KKANT05, KCHENA01, KSALC05, KCHAU000103, KKLON010395, KSTEW97, KMAYO039297, KBLIN03, KPELL9697, KLSAL8797, KBIGS8797, KTATC87970203, KNORD03, KNISU8797, KTAKH039702, KWHITE858797, KGOMF050693, KAROL05, CHKAN929305, KEEK0205, KKWET01, KKANE01KISA05, KITUL9394TULU05, KSALM0206, KGEOR0205, CHKOG929305, KISTO94, KCHEE0206, KGAGA06, KIMCG94TAKW05, KTATL0205, KSALM95, KITOG9394, CHNUU92KINUS93, KIMUL94, KISTU9394, KNEK95MAIN04, KBIGCK04, KKSALM06AB, KMESH06, KMILK06, KNELS06, KBLACH06, KSTEE06, KCHIG9506, KIAYA93AYAK06, KIKAR9306, KDESH05MOOD95, CHDEC91112, KWILL05, KPRAI95, KTALA95, KCRESC06, KKILL0506, KBENJ0506, KFUNN0506, KSLIK05, KJUNE0506, KKENA0403M06, KCROCK0592, KKASL05M05, KANCH06, KNINI06, KINDI0405, KBONE0405, KCHIS04E, KOTTER05, KSINO0504, KGULK04MMFPF, KMEND04, KKIAN04, KMANK0405, KGRAY04LTON0405, KTEBA040506, KSITU90889192, KKLUK8990, CHBIG92930495, CHTAH9204, KDIPAC05, KKELS04, KKSR899093, KKING03, KCHIK0390, KICHI93LPWC05, CHWHI929805, KHUMP03, KBUTL04, KCLEA030489, KCRIP0388, KGENE030489, KKERR0304, KLPWU05, CHDMT92DEER94, KKETA8903, KBLOS04, KANDR8904, CHCRY929405, KMEDV9805, KHIDD9498, KMACA05, KKOWA8990, KLTAT899005, KUNAH8990, KNAKI8990, KDUDI05, KTAHL89, KKATEE05, KDAMD96, KKINC96, KKWINA96, KOWEG96, KBULK99, KSUST01, KECST0102, KLKALU01, KLATN96, KKITI97, KWANN96, KKLIN97, KPTCV03, KCONU9798, KMARB009699, KNITI96, KROBE9603, KSARI0197, KBIGQU96, KNANA02, KQUIN96, KMORK01, KSALBC97, KTORP01, KCHIL02959699, KNECH96, KQUES96, KSTUA96, KCLWA97, KLOUI01, KLADA96, KLTHO01, KMSHU8697, KBIRK0102039799, KHARR02, KMAKA01F03F, KFORK05, KUSKAG06SU, KSOOS04F, KLYON0203, KHANF000406, KLDES02, KCARS01, KMCKE04, KALSE04, KSIUS01, KSRF90SRS90GHF06, KEELR00001A, KSACW05
## 
## 3.44% of allelic data identified as missing
sa_to_repu <- sa_ChinookSNPs %>%
  group_by(indiv, collection, repunit, inferred_repunit) %>%
  summarise(repu_scaled_like = sum(scaled_likelihood))

RepGroups <- unique(PopRepGroups_ChinookSNP$RepGroup) [c(9,3,5,10,6,8,4,7,2,1,11)]

sa_mean_scaled_like <- sa_to_repu %>% 
  mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
  mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
  group_by(repunit_f, inferred_repunit_f) %>% 
  summarise(mean_repu_scaled_like = mean(repu_scaled_like))  #take the mean of the scaled liklihoods
  
#Generate a heatmap  
  ggplot(sa_mean_scaled_like,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
  geom_tile(aes(fill = mean_repu_scaled_like)) +
    geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like[,3]),fmt='%#.2f')))+
  scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("Reporting Unit") +
  ylab("Inferred Reporting Unit")

This doesn’t look too bad. Most of the missallocations are to neighboring reporting groups. Chinook probably have a pretty high isolation by distance signal.

100% simulations

Proof100_scenarios.ChinookSNPS <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios.ChinookSNPS) <- paste("All", RepGroups, sep = "-")

Proof100_results.ChinookSNPS <- assess_reference_loo(reference = ChinookSNPBase, 
                     gen_start_col = 5, 
                     reps = 10, 
                     mixsize = 100,
                     alpha_repunit = Proof100_scenarios.ChinookSNPS,
                     alpha_collection = 10) #distribution of individuals from collections ~equal

Proof100_results.ChinookSNPS %>% 
  mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% #[c(2,3,6,5,4,1)]
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) %>% 
  filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>% 
  ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
  geom_bar(stat = 'identity') +
  geom_hline(yintercept = 0.9) +
  scale_fill_discrete(name = "Reporting Group") +
  facet_wrap(~ repunit_f) +
  xlab("Iteration") +
  ylab("Posterior Mean Reporting Group Proportion") 

The 100% mixtures look almost too good to be true! Let’s pull in some realistic fishery proportions from an old tech memo so that we can evaluate scenarios that we might encounter when performing GSI on the trawl bycatch.

Realistic Fishery Simulations

I pulled the 2015 mixture proportions for the 2015 Bering Sea Walleye Pollock (Gadus chalcogrammus) Trawl Fishery (Guthrie III et al. 2017).

ChinookAnalyses<-read_csv("./MasterData/Chinook_2017_gsi.csv")

ChinookAnalysesM<-reshape2::melt(ChinookAnalyses,id="Analysis")
colnames(ChinookAnalysesM)[c(2,3)]<-c("RepGroup","Prop")


RF_scenarios_Chinook <- lapply(unique(ChinookAnalyses$Analysis), function(x)
  ChinookAnalysesM %>%
    filter(Analysis == x) %>%
    ungroup()%>%
    select(-1)%>%
    setNames(.,c("repunit", "ppn" ))
  )

names(RF_scenarios_Chinook) <- ChinookAnalyses$Analysis 

RF_results_Chinook <- assess_reference_loo(reference = ChinookSNPBase, 
                     gen_start_col = 5, 
                     reps = 15, 
                     mixsize = 100,
                     alpha_repunit = RF_scenarios_Chinook,
                     alpha_collection = 10) #distribution of individuals from collections ~equal




RFsum_Chinook <- RF_results_Chinook %>%
  mutate(repunit_f = factor(x = repunit, levels = RepGroups))%>%#[c(2,3,6,5,4,1)]))%>%
  group_by(repunit_scenario, iter, repunit_f) %>% 
  summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>% 
  mutate(repu_n_prop = repu_n / sum(repu_n)) 
  

ggplot(RFsum_Chinook, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
  geom_point() +
  scale_color_discrete(name = "Reporting Group")+
  geom_abline(intercept = 0, slope = 1)+
  geom_abline(intercept = 0.1, slope = 1,lty=2)+
  geom_abline(intercept = -0.1, slope = 1,lty=2)+
  facet_wrap(~ repunit_f)+
  ylab("Posterior Mean Reporting Group Proportion") +
  xlab("Simulated Mixture Proportion")

That looks pretty darn good.

References

DeCovich, N., T.H. Dann, S.D. Rogers Olive, H. L. Liller, E.K.C. Fox, J.R. Jasper, E.L. Chenoweth, C. Habicht, and W.D. Templin. 2012. “Chum Salmon Baseline for the Western Alaska Salmon Stock Identification Program.” Special Publication 12-26. Alaska Department of Fish; Game.

Guthrie III, C.M., Hv.T. Nguyen, A.E. Thomson, and J.R. and Guyon. 2017. “Genetic Stock Composition Analysis of the Chinook Salmon Bycatch from the 2015 Bering Sea Walleye Pollock (Gadus Chalcogrammus) Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-342. 17109 Pt. Lena Loop Rd. Juneau, Alaska 99801: National Oceanic; Atmospheric Administration.

Guyon, J.R., C.M. Kondzela, T McCraney, C. Marvin, and E. Martinson. 2010. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2005 Bering Sea Groundfish Fishery.” 605 W. 4th Ave, Anchorage, Alaska: North Pacific Fishery Management Council.

Kondzela, C.T. AND Vulstek, C.M. AND Marvin, and J.R. Guyon. 2013. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2011 Bering Sea Walleye Pollock Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-243. NOAA.

Kondzela, J.A. AND Vulstek, C.M. AND Whittle, and J.R. Guyon. 2017. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2011 Bering Sea Walleye Pollock Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-345. NOAA.

Larson, W.A., F.M. Utter, K.W. Myers, W.D. Templin, J.E. Seeb, C.M. Guthrie III, A.V. Bugaev, and L.W. Seeb. 2013. “Single-Nucleotide Polymorphisms Reveal Distribution and Migration of Chinook Salmon (Oncorhynchus Tshawytscha) in the Bering Sea and North Pacific Ocean.” Canadian Journal of Fisheries and Aquatic Sciences 70 (1): 128–41. doi:10.1139/cjfas-2012-0233.

Moran, B.M., and E.C. Anderson. 2019. “Bayesian Inference from the Conditional Genetic Stock Identification Model.” Canadian Journal of Fisheries and Aquatic Sciences 76 (4): 551–60. doi:10.1139/cjfas-2018-0016.

Pella, J AND Masuda M. 2001. “Bayesian Methods for Analysis of Stock Mixtures from Genetic Characters.” Fisheries Bulletin 99: 151–67.

Petrou, E.L., J.E. Seeb, L. Hauser, M.J. Witteveen, W.D. Templin, and L.W. Seeb. 2014. “Fine-Scale Sampling Reveals Distinct Isolation by Distance Patterns in Chum Salmon (Oncorhynchus Keta) Populations Occupying a Glacially Dynamic Environment.” Conservation Genetics 15 (1): 229–43. doi:10.1007/s10592-013-0534-3.

Price, M.A. 2001. “Abundance and Run Timing of Adult Chum Salmon, and Steelhead Kelt Emigration in the Sturgeon River, Kodiak, Alaska, 1999.” Alaska Fisheries Data Series 2001-2. U.S. Fish; Wildlife Service.

Seeb, L.W., W.D. Templin, S. Sato, S. Abe, K. Warheit, J.Y. Park, and J.E. Seeb. 2011. “Single Nucleotide Polymorphisms Across a Species’ Range: Implications for Conservation Studies of Pacific Salmon.” Molecular Ecology Resources 11 (SUPPL. 1): 195–217. doi:10.1111/j.1755-0998.2010.02966.x.

Templin, W.D., J.E. Seeb, J.R. Jasper, A.W. Barclay, and L.W. Seeb. 2011. “Genetic Differentiation of Alaska Chinook Salmon: The Missing Link for Migratory Studies.” Molecular Ecology Resources 11 (SUPPL. 1): 226–46. doi:10.1111/j.1755-0998.2010.02968.x.